Results

Factored eQTL models on 48 GTEx tissues

Bayesian sparse prior distribution strongly controls false discovery rate of factored regression model

## WARN [2018-08-09 14:17:53] l2 norm provided. Howewer matrix will be converted to binary (0,1) automatically. 'jaccard' can be computed only on sets which should be encoded as sparse matrices of 0, 1.

We trained the factored eQTL model gene by gene on 18,380 protein coding genes located in autosomal chromosomes, imposing spike-slab prior distribution on the regression coefficients1,2. This Bayesian approach can effectively handle multi-collinearity problem between SNPs within a linkage disequilibrium block. For each gene, we combine gene expression of 48 tissues, after tissue by tissue confounder correction, and construct an individual by tissue matrix and fit our factored regression model including all genotype information within \(\pm\) 1Mb window. We considered a gene is heritable if the maximum value of posterior inclusion probability (PIP) of the SNP effects is greater than \(0.5\). After the hyper-parameter tuning, the estimates of PIP values tend to extreme values, either 0 or 1. We estimated through sample permutation this PIP cutoff corresponds to p-value \(<\) 9.9e-06 and empirical false discovery rate3,4 \(<\) 6.5e-04.

Factored regression model provides parsimonious description of multi-tissue patterns

Of 18,380 protein coding genes, we found 3,139 genes are strongly heritable with respect to our factored eQTL model in GTEx (17.08%). Of them, as there can be multiple factors within a gene, we identified 3,538 independent genetic factors and used them to investigate genetic correlation with 114 GWAS statistics (on average 1.13 genetic factors per gene). Within each gene the factored QTL model combines the regulatory mechanisms shared on average 9.09 tissues and reduce model complexity of the multivariate regression model. Each factor pools information of average 7.87 tissues. Yet, it is interesting to note that our model estimation is highly skewed toward a single tissue model (Fig. ), rather than the characteristic bimodal shape of the univariate eQTL models previously reported by the GTEx consortium5,6.

\label{fig:tis.freq}__Factored QTL models find tissue-specific genetic regulatory mechanisms.__  _Left_: Frequency of the number of active tissues per factor; _Right_: Frequency of the number of active tissues per gene.

Factored QTL models find tissue-specific genetic regulatory mechanisms. Left: Frequency of the number of active tissues per factor; Right: Frequency of the number of active tissues per gene.

\label{fig:tis.sample.size}__Statistical power of factored QTL model depends on samples sizes.__ The _colors_ indiciate different types of tissues.

Statistical power of factored QTL model depends on samples sizes. The colors indiciate different types of tissues.

The statistical power of eQTL discovery is limited to the underlying sample sizes (Fig. ). This is expected as our models specifically highlight tissue-specificity, rather than ubiquitous patterns shared across all tissues. Accumulating all the tissue factors we can create similarity matrix between tissues (Fig. ). We measure inter-tissue similarity by Jaccard coefficients. Although this type of view averaged over all genes is limited in resolution, we find strong similarity between the brain tissues; gene-regulatory mechanisms in the other tissues with similar physiological characters tend to cluster together (e.g., between “Artery Tibial” and “Artery Aorta”).

\label{fig:tis.jaccard}__Average similarity patterns across tissues spearate the brain tissues from others.__

Average similarity patterns across tissues spearate the brain tissues from others.

2,628 genes are regulated by multiple SNPs

In the sparse multivariate eQTL analysis we estimate the expression of 2,628 genes are regulated by more than one QTL SNPs (83.72%). We also find 2,862 factors are regulated by multiple SNPs (80.89%). Since the variational inference2 with the spike-slab prior1 generally tags fewer number of causal (up to linkage disequilibrium) than the actual number of independent causal variants, we may be able to consider that nearly all genes are regulated by multiple independent causal SNPs. In our simulation experiments on small set of GTEx data7, we found our factored QTL approach outperform most existing methods, especially when genetic regulatory mechanisms are shared across multiple tissues. On average we identify 7.59 causal SNPs per gene; 6.77 per factor. The degree distributions of QTLs per gene and factor seem to follow a power law distribution with fat heavy tails (Fig. ).

\label{fig:snp.freq}__Degree distribution of causal eQTLs follow the power law distribution.__

Degree distribution of causal eQTLs follow the power law distribution.

Our factored regression tends to include more SNPs to describe more tissues with correlation 0.82 at the factor-level and correlation 0.83 at the gene-level (Fig.). These correlations were calculated between the log-transformed SNP and tissue degrees. As for thorough biological interpretations, a spare regression models with relevant epigenomic and other pathway-level annotations will be greatly beneficial8.

\label{fig:snp.tis.deg.cor}__SNP and tissue degrees are correlated within factor-level and genel-level models.__

SNP and tissue degrees are correlated within factor-level and genel-level models.

Transcriptome-wide association studies (TWAS) on 114 GWAS summary statistics

Genetic regulation of 1,214 genes are associated with 114 GWAS traits

We tested genetic correlation between the multivariate eQTL effects and GWAS summary statistics using summary statistics-based TWAS9,10 on each trait independently. After applying multiple hypothesis correction of 3,467 tests in each trait, with p-value threshold 1.4e-05, we discovered 1,252 factors of 1,214 genes as a potential regulator of these GWAS traits.

\label{fig:twas.trait.degree}

These TWAS genes are pleiotropic along the GWAS trait and GTEx tissue axes. On average, they are associated with 5.43 traits and active in 12.76 tissues at the factor-level; 5.56 traits and active in 13.14 tissues. However we note that the degree of gene-level and factor-level pleiotropy is bounded and empirically the probability decreases exponentially (Fig.).

We also note that the degrees of tissues and GWAS traits are weakly correlated with 0.36 at the factor-level and 0.37 at the gene-level (Fig.). In other words, this results suggest that multi-tissue genes are generally pleiotropic in TWAS; therefore, when it comes to the interpretation GWAS using eQTL data, multi-tissue genes may provide sufficient level of resolutions.

\label{fig:tis.trait.deg.cor}__Tissue and trait degrees are weakly correlated within factor-level and genel-level models.__

Tissue and trait degrees are weakly correlated within factor-level and genel-level models.

Trait-trait correlation of TWAS associations reveal distinctive gene-level pleiotropy of the blood and metabolic traits

Observing the maximum degree of trait pleiotropy within a gene is limited (\(\le\) 39), we seek to understand a global architecture of 1,252 the multi-tissue genetic factors based on the trait-sharing patterns by clustering columns of a giant trait by factor binary matrix \(F\) (Supp.Fig.), where \(F_{tg} = 1\) if and only if genetic factor \(g\) is significantly associated with trait \(t\), otherwise \(F_{tg} = 0\). First we investigate overall similarity between traits (Fig.). Between trait \(t\) and \(s\) we measure Jaccard similarity \(J_{ts}=\sum_{g}F_{tg}F_{sg}/(\sum_{g}F_{tg} + \sum_{g}F_{sg} - \sum_{g}F_{tg}F_{sg})\). One of the most salient patterns capture a group of blood cell count traits11. Also, we notice that different body mass index (BMI) of multiple studies are tightly grouped together. Since this trait-trait correlation matrix simply takes average over all genes, it dose not provide fine-resolution pictures of gene-level pleiotropic patterns.

## WARN [2018-08-09 14:18:05] l2 norm provided. Howewer matrix will be converted to binary (0,1) automatically. 'jaccard' can be computed only on sets which should be encoded as sparse matrices of 0, 1.
\label{fig:trait.trait.jaccard}__Trait-trait correlation of TWAS associations reveal distinctive pleiotropic clusters.__

Trait-trait correlation of TWAS associations reveal distinctive pleiotropic clusters.

1,252 factors are clustered into 126 groups according to factor-specific multi-trait association patterns

We cluster TWAS genes (genetic factors) based on factor-specific multi-trait association patterns. We treat each column the TWAS feature matrix \(F\) as a set of associated traits, we can measure similarity between them by Jaccard coefficients. For instance of genetic factors \(g\) and \(g'\), \(J_{gg'}=\sum_{t}F_{tg}F_{tg'}/(\sum_{t}F_{tg} + \sum_{t}F_{tg'} - \sum_{t}F_{tg}F_{tg'})\). We used fast Rcpp implementation of text2vec package (see http://text2vec.org for details). We constructed a weighted graph connecting genetic factors, retaining only top 5% of interactions to reduce weak and noisy interactions (with the Jaccard coefficient cutoff = 0.12). We weight the retained interactions by the fold change of Jaccard coefficient with respect to the baseline cutoff (see Fig. for its adjacency matrix). We then fitted a hierarchical stochastic block model with Poisson distribution on the network to identify distinctive groups of the genetic factors.

\label{fig:twas.cluster}__Clustering of genetic factors according to multi-trait TWAS association patterns reveals modular structure of TWAS genes.__ _1st row_: Number of SNPs used in the factored eQTL models for each cluster. _2nd row_: Number of genes in each cluster. _3rd row_: Cluster-specific average trait frquency matrix (trait rows and cluster columns)

Clustering of genetic factors according to multi-trait TWAS association patterns reveals modular structure of TWAS genes. 1st row: Number of SNPs used in the factored eQTL models for each cluster. 2nd row: Number of genes in each cluster. 3rd row: Cluster-specific average trait frquency matrix (trait rows and cluster columns)

Roughly each gene cluster is associated with average 5.21 traits.

\label{fig:}

Limited tissue-specificity

Supplementary materials

Supplementary figures

\label{fig.supp:twas.feature.mat}__TWAS__

TWAS

\label{fig.supp:twas.network}__TWAS Network__

TWAS Network

References

1. Mitchell, T. J. & Beauchamp, J. J. Bayesian Variable Selection in Linear Regression. J. Am. Stat. Assoc. 83, 1023–1032 (1988).

2. Carbonetto, P. & Stephens, M. Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal. 7, 73–108 (2012).

3. Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences 100, 9440–9445 (2003).

4. Storey, J. D. The positive false discovery rate: a Bayesian interpretation and the q-value. Ann. Stat. 31, 2013–2035 (2003).

5. GTEx Consortium. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).

6. GTEx Consortium et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

7. Park, Y., Sarkar, A. K., Bhutani, K. & Kellis, M. Multi-tissue polygenic models for transcriptome-wide association studies. bioRxiv 107623 (2017).

8. Carbonetto, P. & Stephens, M. Integrated enrichment analysis of variants and pathways in Genome-Wide association studies indicates central role for IL-2 signaling genes in type 1 diabetes, and cytokine signaling genes in crohn’s disease. PLoS Genet. 9, e1003770 (2013).

9. Mancuso, N. et al. Integrating Gene Expression with Summary Association Statistics to Identify Genes Associated with 30 Complex Traits. Am. J. Hum. Genet. 100, 473–487 (2017).

10. Gusev, A. et al. Transcriptome-wide association study of schizophrenia and chromatin activity yields mechanistic disease insights. Nat. Genet. 50, 538–548 (2018).

11. Astle, W. J. et al. The Allelic Landscape of Human Blood Cell Trait Variation and Links to Common Complex Disease. Cell 167, 1415–1429.e19 (2016).